As with most of the projects, we start choosing a working directory to indicate R where are our files.
getwd() # Current working directory
setwd("C:/GEO77_R_course") # Set a working directory
setwd("C:\\GEO77_R_course") # another valid way to select the same working directory
setwd("C:\GEO77_R_course") # an invalid way of selecting the same working directory
or go to Session/Set Working Directory/Choose directory… and select folder
or press Ctrl+⇧Shift+H and select folder
Why?
R has the ability to load multiple file formats, which can be further extended through packages.
Reading a dataset read.table()
soil_data <- read.table("./data/soil_data.txt", header = TRUE)
names(soil_data)
## [1] "ID" "X_coord" "Y_coord" "Id"
## [5] "pH" "P" "K" "Mg"
## [9] "horiz" "upper_depth" "lower_depth" "stones"
## [13] "coarse_sand" "medium_sand" "fine_sand" "coarse_silt"
## [17] "medium_silt" "fine_silt" "clay" "soil_type_ka4"
## [21] "pH_cacl2"
str(soil_data)
## 'data.frame': 57 obs. of 21 variables:
## $ ID : int 0 1 2 3 4 5 6 7 8 9 ...
## $ X_coord : int 4428062 4427932 4428028 4428092 4428089 4428105 4428122 4428129 4428086 4427954 ...
## $ Y_coord : int 5749930 5749890 5749724 5749577 5749360 5749189 5749044 5748889 5748785 5748947 ...
## $ Id : int 110 111 112 113 114 115 116 117 118 119 ...
## $ pH : num 6.3 7 6.1 6.6 6 5.5 5.5 5.2 5.2 5.3 ...
## $ P : num 25.4 88 11.7 7.8 4.6 4.4 6.4 5.5 12.6 9.5 ...
## $ K : num 16 23 22 18 16 19 15 13 17 17 ...
## $ Mg : num 7.3 7.4 8.2 7.8 7.6 6.9 6 5.9 5.8 6.7 ...
## $ horiz : chr "Ap" "Ap" "Ap" "Ap" ...
## $ upper_depth : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lower_depth : int 3 3 3 3 3 3 3 3 3 3 ...
## $ stones : num 1.1 3.4 1.5 0.04 0.2 0.4 1.4 1.3 26.2 24.7 ...
## $ coarse_sand : num 1.4 3.2 2.1 0.8 0.6 0.8 2 2.2 8.3 8.9 ...
## $ medium_sand : num 5.8 14.7 6.5 1.1 1.1 1.5 2.9 2.9 8.7 9.7 ...
## $ fine_sand : num 27.6 6.9 3.9 1.7 1.7 2.8 3.7 3.9 8.9 9.3 ...
## $ coarse_silt : num 39.5 42.1 47.1 53.3 52.3 52.8 51.5 52.4 43.9 42.9 ...
## $ medium_silt : num 11 10.6 13.7 13.7 16.1 15.2 14.8 14.5 10 12 ...
## $ fine_silt : num 2.9 5.9 6.2 6.5 7.6 6.6 6.6 7.2 6.2 5.6 ...
## $ clay : num 11.8 16.6 20.5 22.9 20.6 20.3 18.5 16.9 14 11.6 ...
## $ soil_type_ka4: chr "Uls" "Uls" "Ut4" "Ut4" ...
## $ pH_cacl2 : num 7.33 7.26 5.87 6.79 5.77 6.3 6.46 5.79 5.21 7.25 ...
| ID | X_coord | Y_coord | Id | pH | P | K | Mg | horiz | upper_depth | lower_depth | stones | coarse_sand | medium_sand | fine_sand | coarse_silt | medium_silt | fine_silt | clay | soil_type_ka4 | pH_cacl2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4428062 | 5749930 | 110 | 6.3 | 25.4 | 16 | 7.3 | Ap | 0 | 3 | 1.10 | 1.4 | 5.8 | 27.6 | 39.5 | 11.0 | 2.9 | 11.8 | Uls | 7.33 |
| 1 | 4427932 | 5749890 | 111 | 7.0 | 88.0 | 23 | 7.4 | Ap | 0 | 3 | 3.40 | 3.2 | 14.7 | 6.9 | 42.1 | 10.6 | 5.9 | 16.6 | Uls | 7.26 |
| 2 | 4428028 | 5749724 | 112 | 6.1 | 11.7 | 22 | 8.2 | Ap | 0 | 3 | 1.50 | 2.1 | 6.5 | 3.9 | 47.1 | 13.7 | 6.2 | 20.5 | Ut4 | 5.87 |
| 3 | 4428092 | 5749577 | 113 | 6.6 | 7.8 | 18 | 7.8 | Ap | 0 | 3 | 0.04 | 0.8 | 1.1 | 1.7 | 53.3 | 13.7 | 6.5 | 22.9 | Ut4 | 6.79 |
| 4 | 4428089 | 5749360 | 114 | 6.0 | 4.6 | 16 | 7.6 | Ap | 0 | 3 | 0.20 | 0.6 | 1.1 | 1.7 | 52.3 | 16.1 | 7.6 | 20.6 | Ut4 | 5.77 |
| 5 | 4428105 | 5749189 | 115 | 5.5 | 4.4 | 19 | 6.9 | Ap | 0 | 3 | 0.40 | 0.8 | 1.5 | 2.8 | 52.8 | 15.2 | 6.6 | 20.3 | Ut4 | 6.30 |
| 6 | 4428122 | 5749044 | 116 | 5.5 | 6.4 | 15 | 6.0 | Ap | 0 | 3 | 1.40 | 2.0 | 2.9 | 3.7 | 51.5 | 14.8 | 6.6 | 18.5 | Ut4 | 6.46 |
| 7 | 4428129 | 5748889 | 117 | 5.2 | 5.5 | 13 | 5.9 | Ap | 0 | 3 | 1.30 | 2.2 | 2.9 | 3.9 | 52.4 | 14.5 | 7.2 | 16.9 | Ut3 | 5.79 |
| 8 | 4428086 | 5748785 | 118 | 5.2 | 12.6 | 17 | 5.8 | Ap | 0 | 3 | 26.20 | 8.3 | 8.7 | 8.9 | 43.9 | 10.0 | 6.2 | 14.0 | Uls | 5.21 |
| 9 | 4427954 | 5748947 | 119 | 5.3 | 9.5 | 17 | 6.7 | Ap | 0 | 3 | 24.70 | 8.9 | 9.7 | 9.3 | 42.9 | 12.0 | 5.6 | 11.6 | Uls | 7.25 |
| 10 | 4427887 | 5749114 | 120 | 6.5 | 18.9 | 22 | 7.0 | Ap | 0 | 3 | 10.50 | 8.4 | 8.0 | 8.7 | 40.3 | 11.5 | 6.1 | 17.0 | Lu | 6.27 |
| 11 | 4427836 | 5749319 | 121 | 6.1 | 7.2 | 28 | 9.1 | Ap | 0 | 3 | 1.20 | 2.6 | 2.5 | 6.4 | 48.0 | 15.2 | 4.5 | 20.8 | Ut4 | 6.81 |
| 12 | 4427950 | 5749243 | 122 | 5.5 | 7.1 | 21 | 7.0 | Ap | 0 | 3 | 33.90 | 7.3 | 8.8 | 7.3 | 44.3 | 12.5 | 4.3 | 15.5 | Uls | 5.86 |
| 13 | 4427929 | 5749496 | 123 | 6.0 | 5.1 | 19 | 7.6 | Ap | 0 | 3 | 0.20 | 0.7 | 1.5 | 2.6 | 52.0 | 15.7 | 5.8 | 21.7 | Ut4 | 6.04 |
| 14 | 4427807 | 5749625 | 124 | 6.4 | 7.9 | 22 | 9.5 | Ap | 0 | 3 | 0.60 | 1.1 | 2.2 | 4.3 | 50.6 | 13.7 | 6.0 | 22.1 | Ut4 | 6.29 |
| 15 | 4427786 | 5749495 | 125 | 6.3 | 9.5 | 34 | 7.6 | Ap | 0 | 3 | 1.00 | 2.2 | 2.6 | 4.5 | 51.5 | 13.4 | 6.0 | 19.8 | Ut4 | 5.88 |
| 16 | 4427579 | 5749612 | 126 | 6.9 | 21.1 | 25 | 5.1 | Ap | 0 | 3 | 8.70 | 2.4 | 3.0 | 7.3 | 44.6 | 12.5 | 7.6 | 22.6 | Lu | 7.05 |
| 17 | 4427668 | 5749765 | 127 | 6.7 | 17.0 | 24 | 6.6 | Ap | 0 | 3 | 12.70 | 5.7 | 7.3 | 7.7 | 42.9 | 10.3 | 4.0 | 22.1 | Lu | 7.10 |
| 18 | 4427780 | 5749847 | 128 | 6.2 | 31.3 | 31 | 9.7 | Ap | 0 | 3 | 0.50 | 1.3 | 3.2 | 5.7 | 47.8 | 11.7 | 6.4 | 23.9 | Ut4 | 5.49 |
| 19 | 4427563 | 5749885 | 129 | 7.1 | 23.6 | 29 | 5.9 | Ap | 0 | 3 | 3.30 | 2.9 | 3.8 | 5.0 | 47.6 | 11.6 | 5.1 | 24.0 | Lu | 7.36 |
| 20 | 4427461 | 5749819 | 130 | 7.0 | 24.3 | 32 | 7.3 | Ap | 0 | 3 | 0.90 | 3.2 | 3.1 | 13.3 | 42.5 | 16.2 | 5.0 | 16.7 | Uls | 7.02 |
| 21 | 4427347 | 5749787 | 131 | 6.8 | 20.1 | 27 | 6.8 | Ap | 0 | 3 | 1.70 | 10.2 | 5.5 | 31.9 | 27.8 | 7.5 | 4.2 | 12.9 | Sl4 | 6.91 |
| 22 | 4427303 | 5749690 | 132 | 7.2 | 25.0 | 21 | 3.0 | Ap | 0 | 3 | 2.60 | 28.8 | 7.5 | 23.1 | 19.1 | 5.8 | 3.8 | 11.9 | Sl3 | 7.29 |
| 23 | 4427560 | 5749481 | 133 | 7.0 | 20.5 | 18 | 4.1 | Ap | 0 | 3 | 6.60 | 21.2 | 6.9 | 40.2 | 14.8 | 2.9 | 3.1 | 10.9 | Sl3 | 6.78 |
| 24 | 4427704 | 5749249 | 134 | 6.7 | 11.1 | 20 | 7.0 | Ap | 0 | 3 | 32.80 | 22.1 | 12.2 | 16.8 | 23.3 | 8.7 | 6.9 | 10.0 | Sl3 | 6.35 |
| 25 | 4427785 | 5749011 | 135 | 7.1 | 7.6 | 17 | 8.1 | Ap | 0 | 3 | 3.20 | 0.3 | 0.7 | 4.6 | 57.4 | 14.8 | 6.6 | 15.6 | Ut3 | 7.49 |
| 26 | 4427618 | 5749083 | 136 | 6.2 | 5.9 | 22 | 8.7 | Ap | 0 | 3 | 0.60 | 0.8 | 1.7 | 6.8 | 51.5 | 15.9 | 7.2 | 16.1 | Ut3 | 6.22 |
| 27 | 4427660 | 5748900 | 137 | 6.5 | 6.1 | 19 | 10.4 | Ap | 0 | 3 | 0.30 | 0.5 | 0.8 | 5.2 | 52.5 | 14.6 | 7.5 | 18.9 | Ut4 | 6.80 |
| 28 | 4427849 | 5748858 | 138 | 7.2 | 8.5 | 16 | 7.5 | Ap | 0 | 3 | 0.20 | 0.4 | 1.0 | 6.7 | 46.1 | 17.5 | 8.6 | 19.7 | Ut4 | 7.40 |
| 29 | 4427440 | 5748925 | 139 | 5.7 | 3.4 | 16 | 8.8 | Ap | 0 | 3 | 0.20 | 0.3 | 0.8 | 5.1 | 48.1 | 17.6 | 6.1 | 22.0 | Ut4 | 5.16 |
| 30 | 4427476 | 5749111 | 140 | 6.1 | 3.6 | 14 | 8.2 | Ap | 0 | 3 | 0.20 | 0.5 | 0.8 | 4.7 | 53.1 | 13.0 | 6.6 | 21.3 | Ut4 | 6.56 |
| 31 | 4427512 | 5749330 | 141 | 6.6 | 7.0 | 20 | 7.9 | Ap | 0 | 3 | 0.40 | 0.4 | 0.7 | 4.3 | 58.7 | 15.1 | 4.2 | 16.6 | Ut3 | 7.46 |
| 32 | 4427357 | 5749507 | 142 | 7.0 | 11.4 | 18 | 5.6 | Ap | 0 | 3 | 0.30 | 1.4 | 1.9 | 11.0 | 47.1 | 15.0 | 4.9 | 18.7 | Ut4 | 7.19 |
| 33 | 4427194 | 5749687 | 143 | 7.3 | 16.5 | 26 | 6.9 | Ap | 0 | 3 | 0.20 | 1.1 | 1.4 | 7.2 | 51.4 | 15.1 | 6.6 | 17.2 | Ut4 | 7.41 |
| 34 | 4427095 | 5749673 | 144 | 7.4 | 13.7 | 18 | 8.8 | Ap | 0 | 3 | 0.20 | 0.8 | 1.1 | 5.2 | 50.7 | 14.5 | 6.8 | 20.9 | Ut4 | 7.42 |
| 35 | 4427179 | 5749506 | 145 | 7.3 | 12.2 | 19 | 6.4 | Ap | 0 | 3 | 0.20 | 0.3 | 0.9 | 7.6 | 50.3 | 16.5 | 7.7 | 16.7 | Ut3 | 7.46 |
| 36 | 4427373 | 5749355 | 146 | 7.1 | 11.0 | 24 | 7.5 | Ap | 0 | 3 | 11.10 | 2.8 | 2.8 | 16.8 | 30.4 | 9.3 | 8.5 | 29.4 | Lt2 | 7.31 |
| 37 | 4427351 | 5749188 | 147 | 6.3 | 4.1 | 19 | 8.3 | Ap | 0 | 3 | 0.70 | 1.3 | 1.7 | 7.3 | 48.4 | 14.3 | 7.6 | 19.4 | Ut4 | 6.51 |
| 38 | 4427197 | 5749195 | 148 | 5.8 | 6.7 | 25 | 7.9 | Ap | 0 | 3 | 5.00 | 2.7 | 2.9 | 13.6 | 43.7 | 11.7 | 7.0 | 18.4 | Lu | 4.93 |
| 39 | 4427135 | 5749339 | 149 | 7.2 | 7.7 | 29 | 6.2 | Ap | 0 | 3 | 15.90 | 7.8 | 5.7 | 14.6 | 26.0 | 12.2 | 9.0 | 24.7 | Ls2 | 7.43 |
| 40 | 4427010 | 5749504 | 150 | 7.3 | 11.3 | 17 | 8.8 | Ap | 0 | 3 | 0.20 | 0.4 | 0.7 | 8.1 | 46.4 | 18.6 | 9.6 | 16.2 | Ut3 | 7.33 |
| 41 | 4426918 | 5749618 | 151 | 7.2 | 13.7 | 21 | 10.2 | Ap | 0 | 3 | 0.10 | 0.7 | 1.3 | 5.7 | 47.6 | 18.7 | 6.2 | 19.8 | Ut4 | 7.32 |
| 42 | 4426723 | 5749512 | 152 | 7.3 | 15.8 | 18 | 6.8 | Ap | 0 | 3 | 1.60 | 1.0 | 0.9 | 10.7 | 47.4 | 15.2 | 6.2 | 18.6 | Ut4 | 7.44 |
| 43 | 4426881 | 5749397 | 153 | 7.3 | 13.5 | 18 | 11.3 | Ap | 0 | 3 | 0.10 | 0.3 | 1.0 | 7.3 | 60.1 | 10.4 | 5.6 | 15.3 | Ut3 | 7.44 |
| 44 | 4426941 | 5749277 | 154 | 7.2 | 15.0 | 28 | 7.7 | Ap | 0 | 3 | 1.10 | 1.7 | 2.0 | 22.0 | 40.2 | 12.4 | 6.5 | 15.2 | Uls | 7.16 |
| 45 | 4427011 | 5749134 | 155 | 6.8 | 11.0 | 23 | 8.5 | Ap | 0 | 3 | 1.20 | 1.3 | 1.5 | 15.5 | 40.9 | 15.1 | 6.6 | 19.1 | Lu | 6.44 |
| 46 | 4427115 | 5749038 | 156 | 6.4 | 21.8 | 29 | 8.4 | Ap | 0 | 3 | 9.20 | 2.5 | 2.8 | 15.6 | 42.6 | 15.1 | 6.6 | 14.8 | Uls | 6.37 |
| 47 | 4427237 | 5748957 | 157 | 5.6 | 6.4 | 20 | 8.2 | Ap | 0 | 3 | 0.50 | 0.5 | 2.5 | 8.5 | 44.5 | 17.8 | 7.5 | 18.7 | Ut4 | 5.01 |
| 48 | 4426845 | 5749062 | 158 | 7.1 | 11.4 | 18 | 8.7 | Ap | 0 | 3 | 13.80 | 1.1 | 1.2 | 40.8 | 28.7 | 7.8 | 4.7 | 15.7 | Slu | 7.39 |
| 49 | 4426762 | 5749297 | 159 | 7.2 | 14.6 | 22 | 7.1 | Ap | 0 | 3 | 0.10 | 0.4 | 0.9 | 7.4 | 49.5 | 15.3 | 7.5 | 19.0 | Ut4 | 7.41 |
| 50 | 4426567 | 5749453 | 160 | 7.2 | 19.1 | 20 | 8.1 | Ap | 0 | 3 | 1.20 | 0.7 | 0.9 | 10.3 | 54.6 | 11.9 | 8.1 | 13.5 | Ut3 | 7.09 |
| 51 | 4426632 | 5749287 | 161 | 7.1 | 12.1 | 17 | 10.8 | Ap | 0 | 3 | 0.10 | 0.4 | 1.0 | 8.8 | 48.4 | 14.9 | 8.1 | 18.4 | Ut4 | 7.18 |
| 52 | 4426676 | 5749108 | 162 | 7.3 | 14.2 | 18 | 7.5 | Ap | 0 | 3 | 5.00 | 1.3 | 2.1 | 18.2 | 43.7 | 12.3 | 4.6 | 17.8 | Lu | 7.41 |
| 53 | 4426481 | 5749154 | 163 | 7.3 | 18.2 | 17 | 8.6 | Ap | 0 | 3 | 1.60 | 1.4 | 1.6 | 19.6 | 36.1 | 12.7 | 4.8 | 23.8 | Lu | 7.43 |
| 54 | 4426426 | 5749333 | 164 | 7.3 | 25.5 | 20 | 8.0 | Ap | 0 | 3 | 1.00 | 1.4 | 1.2 | 9.7 | 52.4 | 12.3 | 5.5 | 17.5 | Ut4 | 7.44 |
| 55 | 4426255 | 5749196 | 165 | 6.9 | 14.5 | 17 | 8.5 | Ap | 0 | 3 | 0.60 | 1.1 | 1.4 | 14.5 | 47.5 | 14.6 | 5.0 | 15.9 | Ut3 | 6.06 |
| 56 | 4426093 | 5749162 | 166 | 7.2 | 21.2 | 19 | 6.2 | Ap | 0 | 3 | 1.30 | 1.3 | 1.8 | 15.1 | 50.0 | 12.5 | 4.4 | 14.9 | Ut3 | 7.31 |
write.table()R also allows to export our output/results in different formats:
write.table(soil_data, file = "output/this_is_a_copy_of_soil_data_made_with_R.csv", sep = ",",
row.names = FALSE, col.names = TRUE)
When we work with data, is important to describe our data to understand it better. There are a wide variety of statistics used to describe or summarize data in terms of central tendency, shape, position, etc.
pH <- soil_data$pH
K <- soil_data$K
mean(pH)
## [1] 6.624561
median(pH)
## [1] 6.8
min(pH)
## [1] 5.2
max(pH)
## [1] 7.4
range(pH)
## [1] 5.2 7.4
sd(pH)
## [1] 0.6467724
quantile(pH)
## 0% 25% 50% 75% 100%
## 5.2 6.2 6.8 7.2 7.4
summary(pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.200 6.200 6.800 6.625 7.200 7.400
some functions are not included in base R, but implemented via
packages. E.g. moments
library(moments)
skewness(pH)
## [1] -0.7153593
Skewness indicates if our data distribution is symmetric or not:
kurtosis(pH)
## [1] 2.315821
Kurtosis describe is how long is the tail of our data distribution:
Pearson correlation coefficient cor()
\[{r_{xy}=\frac{\sum_{i = 1}^{n}(x_i-\overline{x})(y_{i}-\overline{y})}{\sqrt{\sum_{i = 1}^{n}(x_{i}-\overline{x})^2}{\sqrt{\sum_{i = 1}^{n}(y_{i}-\overline{y})^2}}}}\]
where:
and analogously for \({\bar {y}}\)
Examples:
So now with our data…
cor(pH, K)
## [1] 0.1480799
lm()in R, the linear models are written as
lm(y ~ x, data)
Example:
# loading some example data
pH <- c(7.25, 7.91, 5.43, 8.96, 3.21, 4.32, 6.45, 7.22, 9.1, 6.77, 8.43, 5.51, 7.5, 6.64, 4.55, 2.18, 9.79, 7.65)
Corg <- c(4.23, 3.22, 3.1, 7.39, 3.11, 2.37, 5.51, 5.01, 7.26, 6.4, 6.38, 3.37, 8.29, 6.93, 2.87, 0.12, 7.57, 9.4)
data <- as.data.frame(cbind(pH, Corg))
linearModel <- lm(pH ~ Corg , data = data)
summary(linearModel)
##
## Call:
## lm(formula = pH ~ Corg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0397 -0.9752 -0.1567 0.9637 2.5869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1757 0.7173 4.427 0.000423 ***
## Corg 0.6669 0.1264 5.276 7.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.289 on 16 degrees of freedom
## Multiple R-squared: 0.635, Adjusted R-squared: 0.6122
## F-statistic: 27.84 on 1 and 16 DF, p-value: 7.535e-05
plot(linearModel)
When running this command, we obtain a set of 4 plots that are out of
the scope of this “introduction”, but for the curious ones:
This is just a visual check, not an air-tight proof, so it is somewhat subjective. But it allows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.
Source: Understanding Diagnostic Plots for Linear Regression Analysis
cor()\[R^2≡1-\frac{\sum(y_{i}-\hat{y_{i}})^2}{\sum(y_{i}-\overline{y})^2}\]
\[R^2=r×r\]
cor(data$pH, linearModel$fitted.values) *
cor(data$pH, linearModel$fitted.values)
## [1] 0.6349996
\[RMSE=\sqrt{\frac{1}{n}\sum_{i = 1}^{n}{(\hat{y_{i}}-y_{i})^2}}\]
sqrt(1 / length(rbind(linearModel$residuals * sum(rbind(linearModel$residuals) ^ 2))))
## [1] 0.2357023
Base R does not include a function for RMSE in base R, but in R we can create custom functions as follows:
function_name <- function(input_x, input_y, ...) {
some_operations
}
rmse <- function(x, y) {
sqrt(mean((x - y) ^ 2))
}
Requires input
rmse(linearModel$fitted.values, Corg)
## [1] 1.668019
boxplot()boxplot(pH, Corg, names = c("pH value", "Corg content"), ylab = "pH value and Corg content")
What does the boxplot show?
hist()hist(pH,
xlab = "pH value",
ylab = "Frequency",
main = ""
)
plot(density()plot(density(pH),
xlab = "pH value",
ylab = "Frequency",
main = ""
)
plot(x, y)plot(pH,
Corg,
xlab = "pH value",
ylab = "Corg content"
)
Directly in the graphic routines (help with ?par)
col = ...pch = ..., sizes with
cex = ...main = ..., axis label with
xlab = ..., ylab = ...xlim = ...,
ylim = ...After drawing a graphic
lines(...) or
points(...) respectively.text(...) or
mtext(...)title(...)legend(...)Output adjustment
main Title of the diagramxlab labeling of the x-axisylab labeling of the y-axisbreaks frequency classes number of barscol color filling of the barscex gradual scaling of text sizehist(pH,
xlab = "pH value",
ylab = "Frequency",
main = "",
breaks = seq(0, 10, 2),
col = "red",
cex = 0.5
)
RomeWhen working with code, it is very common that there are several ways to solve the same problem.
Exercise: Frequency distribution grain size types KA4
And now for the frequency distribution grain size types KA4!
Hint: The three different version are made with a
barplot(), plot() and hist()
If you can’t solve it by yourself, you can look at the code and try to understand it, but I encourage you to try.
# Variant A
soil_type <- as.data.frame(table(soil_data$soil_type_ka4))
barplot(soil_type$Freq, space = 0, names = soil_type$Var1,
xlab = "Soil type", ylab = "Frequency",
main = "Frequency diagram KA4 Soil types")
# Variant B
plot(factor(soil_data$soil_type_ka4),
xlab = "Soil type",
ylab = "Frequency",
main = "Frequency diagram KA4 Soil types")
# Variant C
soil_type <- as.data.frame(table(soil_data$soil_type_ka4))
hist(as.numeric(factor(soil_data$soil_type_ka4)),
breaks = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
freq=TRUE,
xaxt = "n",
xlab = "Soil type",
ylab = "Frequency",
main = "Frequency diagram KA4 Soil types")
axis(side = 1, at = seq(0.5, 8.5, 1), labels = soil_type$Var1)
When you set graphical parameters in Rstudio using this method, it is important to set the properties, plot the figures and reset the properties to the way they were before. Otherwise, every time we plot something, it will follow the set parameters.
par( # set or query graphical parameters
mfrow = c(1, 3), # 1 x 3 pictures on one plot, equivalent to mfcol = c(3, 1)
mar = c(5.1, 4.1, 4.1, 2.1), # margins as c(bottom, left, top, right)
oma = c(0, 0, 0, 0), # outer margins in lines of text as c(bottom, left, top, right)
mgp = c(3, .1, 0), # margins line for axis title, axis label and axis line
las = 0, # label axis style
cex.lab = 1, # size of the labels
cex.axis = 1, # size of the axis annotation
xpd = FALSE # If FALSE, all plotting is clipped to the plot region, if TRUE, all plotting is clipped to the figure region, and if NA, all plotting is clipped to the device region.
)
plot(…); plot(…); plot(…)
par(mfrow = c(1, 1),
mar = c(5.1, 4.1, 4.1, 2.1),
oma = c(0, 0, 0, 0),
mgp = c(3, 1, 0),
las = 0,
cex.lab = 1,
cex.axis = 1,
)
In R it is possible to export images in the most common formats and can be extended from packages. The output format must be specified and the file will be generated by default in our working directory.
tiff("filename.tiff", width = 21, height = 8, units = "cm", res = 300)
par(…)
plot(…); plot(…); plot(…)
par(…)
dev.off()
No! R and Rstudio provide several tools to help us do this:
R
will ignore all the text that start with #. Is a good
practice to comment our code, is very useful if somebody else need to
understand your code or for yourself in the future. In Rstudio, you can
comment/uncomment the selected text with Ctrl+⇧Shift+CTab ↹, and it will display a list with the possible
parameters for that function. move up ↑ and down
↓ and select with ↵ Return.F1, write ?function,
help(function) or help('function') in the
Rstudio console, and it will display the corresponding documentation for
that function in the help panel.Many functions are not included in the basic version of R. Therefore, there is an almost confusing variety of additional libraries for special applications. Examples:
ggplot2 and lattice for advanced
graphicsdplyr, reshape2 and tidyr for
data manipulationsp and sf for spatial data (shapefiles,
geopackage, etc.)raster, terra, stars for
spatial raster datacaret as wrapping function for machine-learning
librariesRQGIS and RSAGA as bridges to QGIS and
SAGA GISIn the console or the script
install.packages(ggplot2) # install a library
library(ggplot2) # loading a library
require(ggplot2)
detach(ggplot2) # Removing a loaded library
rp <- c("DescTools", "rstudioapi") # required packages
ip <- rp[!(rp %in% installed.packages()[, "Package"])] # subset packages
if(length(ip)) install.packages(ip, dependencies = TRUE) # install packages
lapply(rp, require, character.only = TRUE) # load packages
rm(np, rp) # remove vectos
Let’s start to explore the data with the tools we already know. In this case, we will use the variable P as example.
plot(soil_data$X_coord,
soil_data$Y_coord,
cex=soil_data$P*0.05)
Model: Limited representation of reality
Spatial modeling: Image of states or processes of locatable objects and their characteristic values
Data for spatial modeling:
library(sp)
soil_data_copy <- soil_data
coordinates(soil_data_copy) = ~X_coord + Y_coord
bubble(soil_data_copy,"pH")
library(gstat)
# Determination of the extension (xmin, xmax, ymin, ymax)
xmin <- min(soil_data$X_coord) - 100
xmax <- max(soil_data$X_coord) + 100
ymin <- min(soil_data$Y_coord) - 100
ymax <- max(soil_data$Y_coord) + 100
# Determination of the resolution (cellsize)
cellsize = 10
# Spanning a grid (raster format) for interpolation
grd<- expand.grid(x=seq(from= xmin, to= xmax, by= cellsize), y=seq(from= ymin, to= ymax, by= cellsize))
coordinates(grd) <- ~ x+y
gridded(grd) <- TRUE
Gstat library has for this the function
dataset.idw <- idw(soil_data$pH ~ 1, location = ~X_coord+Y_coord, soil_data, grd)
## [inverse distance weighted interpolation]
image(dataset.idw["var1.pred"],col=topo.colors(20))
points(soil_data$X_coord, soil_data$Y_coord, col ="blue")
#Define spatial objects
coordinates(soil_data) = c("X_coord", "Y_coord")
# Variogram
v <- variogram(pH ~ 1, locations = coordinates, data = soil_data, width = cellsize * 0.5)
# Variogram fit
v.fit <- fit.variogram(v, fit.method = TRUE, model = vgm(0.1, "Gau", 600))
# Output of the variogram
plot(v, model = v.fit)
krige() function.soil_data <- read.table("data/soil_data.txt", header = TRUE)
# ordinary kriging:
z <- krige(formula = pH ~ 1, locations = ~ X_coord + Y_coord, data = soil_data, newdata = grd, model = v.fit, nmax = 500)
## [using ordinary kriging]
image(z, col = topo.colors(20))
points(soil_data$X_coord, soil_data$Y_coord)
title("Ordinary kriging Interpolation of pH-Value")
# simple kriging:
z <- krige(pH ~ 1, locations = ~ X_coord + Y_coord, data = soil_data, newdata = grd, model = v.fit, nmax = 500, beta = mean(soil_data$pH))
## [using simple kriging]
image(z, col = topo.colors(20))
points(soil_data$X_coord, soil_data$Y_coord)
title("Simple kriging Interpolation of pH-Value")
Exercise:
soil_data <- read.table("data/soil_data.txt", header = TRUE)
x <- krige.cv(pH ~ 1, ~ X_coord+Y_coord, data = soil_data, maxdist = 500, nfold = 10)
cor(x$observed, x$var1.pred) * cor(x$observed, x$var1.pred) # Rsquared
## [1] 0.6249335
sqr_error <- (x$var1.pred - x$observed )^2
sqrt(sum(sqr_error)/length(sqr_error)) # RMSE
## [1] 0.3962214
This is a simple example about how to built a decision tree. For this
example, we will use the package rpart
library(rpart)
soil_model <- rpart(pH ~ P + coarse_sand, data = soil_data, method = "anova", control = rpart.control(cp = 0))
The structure of a decision/classification trees can be depicted visually, which helps to understand how the tree makes its decisions.
# Load the rpart.plot package
library(rpart.plot)
# Plot the soil_model with default settings
rpart.plot(soil_model)
A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree. This process always looks to create the split resulting in the greatest improvement to purity (subgroup homogeneity).
Classification trees tend to grow easily due to they divide the data using one variable value, producing the most pure partition each time:
Before building a more sophisticated pH model, it is important to hold out a portion of the data to simulate how well it will predict the pH of unknown data points.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.
The sample() function can be used to generate a random
sample of rows to include in the training set. Simply supply it the
total number of observations and the number needed for training. Then we
can use the resulting vector of row IDs to subset the samples into
training and testing datasets as:
# Determine the number of rows for training
nrow(soil_data)
## [1] 57
nrow(soil_data) * 0.75
## [1] 42.75
# Create a random sample of row IDs
sample_rows <- sample(nrow(soil_data), nrow(soil_data) * 0.75)
# Create the training dataset
soil_train <- soil_data[sample_rows, ]
# Create the test dataset
soil_test <- soil_data[-sample_rows, ]
str(soil_train)
## 'data.frame': 42 obs. of 21 variables:
## $ ID : int 46 50 53 39 37 47 22 41 14 40 ...
## $ X_coord : int 4427115 4426567 4426481 4427135 4427351 4427237 4427303 4426918 4427807 4427010 ...
## $ Y_coord : int 5749038 5749453 5749154 5749339 5749188 5748957 5749690 5749618 5749625 5749504 ...
## $ Id : int 156 160 163 149 147 157 132 151 124 150 ...
## $ pH : num 6.4 7.2 7.3 7.2 6.3 5.6 7.2 7.2 6.4 7.3 ...
## $ P : num 21.8 19.1 18.2 7.7 4.1 6.4 25 13.7 7.9 11.3 ...
## $ K : num 29 20 17 29 19 20 21 21 22 17 ...
## $ Mg : num 8.4 8.1 8.6 6.2 8.3 8.2 3 10.2 9.5 8.8 ...
## $ horiz : chr "Ap" "Ap" "Ap" "Ap" ...
## $ upper_depth : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lower_depth : int 3 3 3 3 3 3 3 3 3 3 ...
## $ stones : num 9.2 1.2 1.6 15.9 0.7 0.5 2.6 0.1 0.6 0.2 ...
## $ coarse_sand : num 2.5 0.7 1.4 7.8 1.3 0.5 28.8 0.7 1.1 0.4 ...
## $ medium_sand : num 2.8 0.9 1.6 5.7 1.7 2.5 7.5 1.3 2.2 0.7 ...
## $ fine_sand : num 15.6 10.3 19.6 14.6 7.3 8.5 23.1 5.7 4.3 8.1 ...
## $ coarse_silt : num 42.6 54.6 36.1 26 48.4 44.5 19.1 47.6 50.6 46.4 ...
## $ medium_silt : num 15.1 11.9 12.7 12.2 14.3 17.8 5.8 18.7 13.7 18.6 ...
## $ fine_silt : num 6.6 8.1 4.8 9 7.6 7.5 3.8 6.2 6 9.6 ...
## $ clay : num 14.8 13.5 23.8 24.7 19.4 18.7 11.9 19.8 22.1 16.2 ...
## $ soil_type_ka4: chr "Uls" "Ut3" "Lu" "Ls2" ...
## $ pH_cacl2 : num 6.37 7.09 7.43 7.43 6.51 5.01 7.29 7.32 6.29 7.33 ...
str(soil_test)
## 'data.frame': 15 obs. of 21 variables:
## $ ID : int 7 8 10 16 18 19 20 23 29 30 ...
## $ X_coord : int 4428129 4428086 4427887 4427579 4427780 4427563 4427461 4427560 4427440 4427476 ...
## $ Y_coord : int 5748889 5748785 5749114 5749612 5749847 5749885 5749819 5749481 5748925 5749111 ...
## $ Id : int 117 118 120 126 128 129 130 133 139 140 ...
## $ pH : num 5.2 5.2 6.5 6.9 6.2 7.1 7 7 5.7 6.1 ...
## $ P : num 5.5 12.6 18.9 21.1 31.3 23.6 24.3 20.5 3.4 3.6 ...
## $ K : num 13 17 22 25 31 29 32 18 16 14 ...
## $ Mg : num 5.9 5.8 7 5.1 9.7 5.9 7.3 4.1 8.8 8.2 ...
## $ horiz : chr "Ap" "Ap" "Ap" "Ap" ...
## $ upper_depth : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lower_depth : int 3 3 3 3 3 3 3 3 3 3 ...
## $ stones : num 1.3 26.2 10.5 8.7 0.5 3.3 0.9 6.6 0.2 0.2 ...
## $ coarse_sand : num 2.2 8.3 8.4 2.4 1.3 2.9 3.2 21.2 0.3 0.5 ...
## $ medium_sand : num 2.9 8.7 8 3 3.2 3.8 3.1 6.9 0.8 0.8 ...
## $ fine_sand : num 3.9 8.9 8.7 7.3 5.7 5 13.3 40.2 5.1 4.7 ...
## $ coarse_silt : num 52.4 43.9 40.3 44.6 47.8 47.6 42.5 14.8 48.1 53.1 ...
## $ medium_silt : num 14.5 10 11.5 12.5 11.7 11.6 16.2 2.9 17.6 13 ...
## $ fine_silt : num 7.2 6.2 6.1 7.6 6.4 5.1 5 3.1 6.1 6.6 ...
## $ clay : num 16.9 14 17 22.6 23.9 24 16.7 10.9 22 21.3 ...
## $ soil_type_ka4: chr "Ut3" "Uls" "Lu" "Lu" ...
## $ pH_cacl2 : num 5.79 5.21 6.27 7.05 5.49 7.36 7.02 6.78 5.16 6.56 ...